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Image retrieval system. 



BACKGROUND OF THE INVENTION 

The invention relates to an image retrieval system comprising: 
a database with candidate images, 
entry means for entering a query image, 

comparison means for comparing the query image with one of the 
5 candidate images, and 

presentation means for presenting at least the candidate image with the 
largest similarity with the query image. 

The invention further relates to a method for retrieving images from a 
database with candidate images, the method comprising the steps of: 
10 - inputting a query image; 

comparing the query image with candidate images to establish respective 
similarities between these candidate images and the query image; and 

presenting at least the candidate image with the largest image similarity 
with the query image. 

15 The invention further relates to a method for organizing images in a 

database. 

The invention further relates to a system for organizing images in a 

database. 

The invention further relates to a database with a plurality of images. 

20 Image retrieval systems are of importance for applications that involve 

large collections of images. Professional applications include broadcast stations where a piece 
of a video may be identified through a set of shots and where a shot of video is to be 
retrieved according to a given image. Also movie producers must be able to find back scenes 
from among a large number of scenes. Furthermore art museums have large collections of 

25 images, from their paintings, photos and drawings, and must be able to retrieve images on 
the basis of some criterion with respect to their contents. Consumer applications include 
maintaining collections of slides, photos and videos, from which the user must be able to 
find back items, e.g. on the basis of similarity with a specified query hnage. 
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An image retrieval system and a method as described above, are Icnown 
from the article "Tools and Techniques for Color Image Retrieval", John R. Smith and Shih- 
Fu Chang, Proc. SPIE - Int. Soc. Opt. Eng (USA), Vol. 2670, pp. 426-437. The image 
retrieval system comprises a database with a large nimiber of images. A user searching for a 

5 particular image specifies a query image as to how the retrieved image or unages should look 
like. Then the system compares the stored images with the query image and ranks the stored 
images according to their similarity with the query image. The ranking results are presented 
to the user who may retrieve one or more of the images. The comparison of the query image 
with a stored image to determine the similarity may be based on a number of features 

10 derived from the respective images. The image feature or features used for comparison are 
called a feature vector. The article describes the usage of a color histogram as such a feature 
vector. When using the RGB (Red, Green and Blue) representation of an image, a color 
histogram is computed by quantizing the colors within the image and counting the number of 
pixels of each color. To determine the similarity, a number of techniques are described to 

15 compare the two color histograms of the respective images. An example of such technique is 
the histogram intersection, where the similarity is the sum over all histogram bms of the 
minimal value of the pair of corresponding bins of the two histograms. 

In a practical set up, the nimiber of images can be very large. On the 
Intemet for example, the number of images can be of the order of millions and is ever 

20 growing. Even if the time to compare the query image with a candidate image is very short, 
the cumulative time needed to compare the query image with all images in the database will 
be long. It is a drawback of the known system that a user searching for an image in such a 
large database must wait a long time after having submitted the query image in the system. 

25 SUMMARY OF THE INVENTION 

It is an object of the invention to provide an image retrieval system of the 
kind set forth in which the time for fmding candidate images similar with the query image is 
reduced. This object is achieved according to the invention in an image retrieval system 
comprising: 

30 - a database with clusters, each cluster comprising a respective set of 

candidate images and a cluster center which is representative for that set; 
entry means for entering a query image; 

cluster comparison means for comparing the query image with respective 
cluster centers to establish respective cluster similarities between the query image and the 
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respective clusters; 

selection means for selecting at least the cluster with the largest cluster 
similarity with the query image; 

image comparison means for comparing the query image with the 
5 candidate images in the selected clusters to establish respective image similarities between the 
query image and the respective candidate images; and 

presentation means for presenting at least the candidate image with the 
largest unage similarity. 

By selecting one or more clusters that are most similar with the query image and 
10 subsequently comparing the query image with oidy the candidate images in the selected 
clusters, fewer comparisons are needed. This reduces the time needed to find the candidate 
images that are similar with the query image. Since the number of clusters is much smaller 
than the number of images^ the number of additional comparisons for comparing the query 
image with the clusters is much smaller than the number of saved comparisons because of 
15 not comparing the query image with the images in the not selected clusters. Clustering of the 
candidate images into clusters according to their similarity does not require the presence of 
any query image. Therefore, the clustering is done in advance and is not done at the time the 
user is actually searching for images on the basis of the query image. So the time needed to 
cluster the images does not add to the waiting time the user of the system experiences when 
20 searching. 

An embodiment of the image retrieval system according to the invention is 
defined in Claim 2. The similarity between images may be determined on the basis of their 
color histograms. The average of the respective histograms of a number of representative 
images of a cluster can advantageously be used as a representation for the whole cluster. 

25 It is a further object of the invention to provide a method for retrieving 

images of the kind set forth with a reduced time for finding candidate images similar with the 
query image. This object is achieved according to the invention in a method for retrieving 
images from a database comprising clusters, each cluster comprising a respective set of 
candidate images and a cluster center which is representative for that set, the method 

30 comprising the steps of: 

inputting a query image; 

comparing the query image with respective cluster centers to establish 
respective cluster similarities between the clusters and the query image; 

selecting at least the cluster with the largest cluster similairity with the 
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query image; 

comparing tiie query image with respective candidate images of the 
selected clusters to establish respective image similarities between these candidate images and 
the query image; and 

5 . presenting at least the candidate image with the largest image similarity. 

By first determining which of the clusters are similar with the query image and by 
subsequently comparing the query image with only the images in those clusters, far fewer 
comparisons are needed. This greatly reduces the time needed to find the candidate images 
that are similar with the query image. 

10 It is a further object of the invention to provide a method for organizing 

images in a database, which resulting database allows to find images that are similar with a 
given query image in a reduced time. This object is achieved according to the invention in a 
method for organizing images in a database, the method comprising the steps of: 

defining clusters each comprising a subset of the images, whereby the 

15 images in a cluster are similar with each other and whereby at least one of the clusters 
comprises more than one image, and 

determining a cluster center for each of the clusters. 
By grouping mutually similar images in respective clusters and by defining respective cluster 
centers for these clusters, a subsequent search to images that are similar with a given query 

20 image can be performed more quickly. The search can first determine on the basis of the 
cluster centers which of the clusters might contain images that are similar with the query 
image. Subsequently the search may limit the further comparisons between the query image 
and the images in the database to these clusters. Consequently fewer comparisons are needed, 
resulting in a shorter time for finding the similar images. 

25 An embodiment of the method for organizing images in a database 

according to the invention is defined in claim 5. Determining among all clusters, which two 
clusters are most similar with each other and by merging these two clusters into a new 
cluster is a good procedure for creating a database with clusters whereby a cluster comprises 
mutually similar images. This procedure may be repeatedly executed, each time merging the 

30 two most similar clusters into a new cluster and thereby reducing the number of clusters by 
one, until a required number of clusters has been reached or until the similarity between the 
two most similar clusters has dropped below a given threshold. 

An embodiment of the method for organizing images in a database 
according to the invention is defined in claim 6. The average of the similarities between all 
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pairs of images in two clusters is a good measure for the similarity between those two 
clusters, since every image in both clusters contributes to this measure. 

An embodiment of the method for organizing images in a database 
according to the mvention is defined in claim 7. When the cluster center of a particular 

5 cluster is determined on the basis of a few images of that particular cluster, it is 

advantageous to select for this purpose respective images from the clusters that were merged 
into this particular cluster. The fact that these clusters were disjunct at some earlier stage 
indicates that an image of one of the clusters is less similar with an image of the other cluster 
than with an image from its own cluster. So selecting an image from each of the clusters 

10 gives a better representation of the diversity of the images in the particular cluster that 
resulted from merging the clusters. 

An embodiment of the method for organizing images in a database 
according to the invention is defined in claim 8. Since the cluster center may be based on 
only a number of representative images, an image may exist that is more similar with a 

15 cluster center of another cluster than with the cluster center of its own cluster. If it is 
determined that one or more such images exist, then these images are moved to the 
respective other clusters thus creating an optimized organization of images into clusters with 
cluster centers. This step of moving the images may be followed by a recomputation of the 
cluster centers of the clusters involved, i.e. the clusters from which an image is moved and 

20 the clusters to which an image is moved, and by again checking whether one or more images 
exist that are more similar with another cluster center than with its own. These steps may be 
repeatedly executed until the number of images to be moved is below a given threshold. 

It is a further object of the invention to provide a system for organizing 
images in a database, which resulting database allows to find images that are similar with a 

25 given query image in a reduced time. This object is achieved according to the invention in a 
system for organizing images in a database, the system comprising: 

clustering means for defining clusters each comprising a subset of the 
images, whereby the images in a cluster are similar with each other and whereby at least one 
of the clusters comprises more than one image, and 

30 - center determining means for determining a cluster center for each of the 

clusters. 

By grouping mutuaUy similar images in respective clusters and by defining respective cluster 
centers for these clusters, an organization of the images is made through which a subsequent 
search to images that are similar with a given query image can be performed more quickly. 
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A first step of the search determines which of the clusters may contain images similar with 
the query image. Then a second step of the search compares the query image with the images 
in these clusters. This greatly reduces the number of comparisons needed to find images that 
are similar with the given query image. 

5 It is a further object of the invention to provide a database with a plurality 

of images in an organization that allows to find images that are similar with a given query 
image in a reduced time. This object is achieved according to the invention in a database 
with a plurality of images, the database comprising: 

clusters each comprising a subset of the images, whereby the images in a 

10 cluster are similar with each other and whereby at least one of the clusters comprises more 
than one image, and 

a cluster center for each of the clusters. 
The grouping of mutually similar images in respective clusters and the respective cluster 
centers for these clusters, make it possible that a subsequent search to images that are similar 

15 with a given query image can be performed more quickly. 

BRIEF DESCRIPTION OF THE DRAWINGS 

The invention and its attendant advantages will be further elucidated with 
the aid of exemplary embodiments and the accompanying schematic drawings, wherein: 
20 Figure 1 schematically shows an image retrieval system according to the 

invention. 

Figure 2 shows a simple example of organizing images ioto clusters 
according to the invention, 

Figure 3 shows an example of the removal of an image which is a direct 
25 child of the root of the cluster. 

Figure 4 shows an example of the removal of an image which is not a 
direct child of the root. 

Figure 5 shows an example of the addition of an image to a cluster. 

Figure 6 shows the most important components of the image retrieval 
30 system according to the invention, and 

Figure 7 shows the most important components of the system for 
organizing the images according to the invention. 



Corresponding features in the various Figures are denoted by the same 
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reference symbols. 

DESCRIPTION OF THE PREFERRED EMBODIMENTS 

Figure 1 schematically shows an image retrieval system according to the 
invention. The system 100 comprises a database 102 with a potentially large collection of 
candidate images. A purpose of the system is to retrieve from the collection one or more 
images that match the wishes of a user of the system. The system performs a content based 
search in the collection of the images, i.e. the content of an image is used as the search or 
ranking criterion, as opposed to systems that search on the basis of keywords in annotation 
added to the images. The images in the database according to the invention are grouped in 
clusters, of which clusters 104, 106 and 108 are shown. Images of a cluster are to a certain 
extent similar with each other. For instance cluster 108 contains images 110, 112, 114 and 
116 which are according to a certain measure similar with each other. The content of an 
image is represented in the system by a so-called feature vector, e.g. image 116 has a feature 
vector 118. In the system according to the invention, a color histogram of the image is used 
as feature vector but the type of feature vector is not essential to the invention and other 
measures expressing the characteristics of the content of an image may be used. The feature 
vector may be stored in the database with the image itself or at some other location in the 
database, e.g. in a table with feature vectors of other images including a reference to the 
image. A cluster has a cluster center representing the contained images, e.g. cluster 108 has 
cluster center 120. In the system according to the invention, the cluster center is the average 
of the color histograms of a ntmiber of representative images in the cluster. Another kind of 
cluster center may be used, e.g. the feature vector of a single image which is chosen as the 
representative infiage for all images in the cluster. 

The system farther comprises an entry unit 122 through which the user 
enters a query image 124. The entry unit may allow the user to compose the query image 
from a number of existing images or to create the query image from scratch. The entry unit 
may include a scanning device for producing a digital image from an image available on 
paper or some other device for producing a digital image. The entry unit determines for the 
query image 124 a feature vector 126 expressing the contents. In order to determine how 
similar the query image is with an unage m the database, the feature vectors of both images 
are used to calculate a similarity measure; The system 100 ftirther comprises a cluster 
comparison unit 128 for comparing the query image 124 with the clusters in the database. 
For each of the clusters, the cluster comparison unit 128 calculates a cluster similarity 130 
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on the basis of the feature vector 126 of the query image and the respective cluster center, 
e.g. 120 of cluster 108: The cluster similarity 130 is a measure how similar the query image 
is with the particular cluster center and, through that, with the images in that cluster. A 
selection unit 132 selects on the basis of the calculated cluster similarities 130 a list 134 of 

5 clusters with the highest similarity. Experiments have shown that the selection of the 10 most 
similar clusters out of a total of 133 clusters, with an average of 15 images per cluster, 
provides for a good overall retrieval accuracy. Subsequently, an image comparison unit 136 
compares the query image 124 with each of the candidate images in the selected clusters. For 
each such candidate image, an image similarity 138 is calculated on the basis of the feature 

10 vector 126 of the query image and the feature vector of the particular candidate image. 

Finally, the candidate images that are most sinailar with the query image 
are presented to the user by a presentation unit 140. The presentation unit 140 presents a list 
142 of images ranked with respect to the calculated image similarities 138. Furthermore, it is 
to be noted that alternative to being determined by the entry unit 122, the feature vector 126 

15 may be determined by the comparison unit 128 or by some other unit specifically arranged 
for that purpose. 

In an embodiment of the image retrieval system according to the 
invention, the feamre vector of an image is its color histogram. The similarity measure 
between two images is calculated on the basis of the two color histograms of these images by 

20 determining the so-called histogram intersection. This technique is described in the article 
"Tools and Techniques for Color Image RetrievaP, John R. Smith and Shih-Fu Chang, 
Proc. SPIE - Int. Soc. Opt. Eng (USA), Vol. 2670, pp. 426-437. 

In a further embodiment an alternative to the histogram intersection 
technique is used by treating the two histograms as two probability distributions. The 

25 question as how similar the two histograms are, can then be answered by measuring how 
different the one probability distribution is from the other. This difference between two 
statistical distributions is called informational divergence or Kullback informational 
divergence and is calculated with the foDowing equation: 



30 DiQWP) - E ewi^gl?^ 



In which: 



wo 99/67695 g PCT/IB99/01008 

Q(x) is the normalized query color histogram, 

P(x) is the normalized candidate color histogram, and 

D(Q II P) is the Kullback informational divergence. 

A more detailed discussion on the Kullback informational divergence is 
presented m the textbook "Information Theory: Coding Theorems for Discrete Memoryless 
Systems", I. Csizar and J. Komer, Akademia Kiado, Budapest, 1981, pages 19-22. 

Equation (1) can be rewritten to 

DiQ\\P) = Yr GWlogGW - 5^ Q{x)logP(x) (2) 

xeX xtX 

The first term in equation (2) is the entropy of distribution Q(x) and is fiiUy determined by 
the contents of the query. Therefore this first term is the same for all candidate images of the 
database and need not be considered when ranking the candidate images with respect to 
similarity to the query image. According to this first embodiment of the image retrieval 
system according to the invention, the similarity between the candidate image and the query 
image is therefore calculated with the following equation: 

Sk(Q^P) = E QMiogPix) (3) 

jctX 



In which: 

Sk(Q,P) is the similarity between the candidate image and the query image, 
Q(x) is the normalized query color histogram, and 
P(x) is the normalized candidate color histogram. 

The value of Sk(Q,P) is used to rank the candidate images with respect to 
their sknilarity with the query image. A relatively large value indicates that two images are 
similar and a relatively low value indicates that two images are dissimilar. 

In a still further embodiment of the image retrieval system according to 
the invention, similarity coefficients are determined for each pair of corresponding bins of 
the two color histograms between which a similarity must be determined. Subsequently the 
obtained collection of similarity coefficients is treated as a probability distribution and the 
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question as how similar the two histograms are, is then answered by analyzing this 
probability distribution; In this embodiment, the similarity coefficients are calculated using 
the following equation: 



5 ,Xf.0 ■ °"°f-^'> (4) 
max(p.,^.) 



In which: 

ri(P,Q) is the similarity coefficient between bin i of the candidate color histogram and bin i 
of the query color histogram, 
10 Pi is the number of pixels in bin i of the candidate color histogram, and 
qi is the number of pixels in bin i of the query color histogram. 

. Especially in cases where the candidate images in the database have 
significantly different color histograms, comparison on the basis of the similarity coefficients 
15 as such is not sufficient. Therefore the distribution of the similarity coefficients over the bins 
is analyzed. First the distribution is normalized using the following equation: 



/ e [0,N - 11 



(5) 



20 In which: 

Si is an element of the normalized probability distribution S, 
r| is calculated using equation (4), and 
N is the number of bins. 

25 The flatness of the distribution S is used in addition to the similarity 

coefficients themselves for determining the similarity between the candidate color histogram 
and the query color histogram. A flat distribution indicates a good overall match, while one 
with few peaks indicates a good match over a few bins. The level of flamess of the 
probability distribution S is measured by calculating its entropy using the following equation: 



30 
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(6) 



In which: 



5 



H(S) is the entropy of distribution S. 

Sj is an element of the distribution S, calculated using equation (5), and 
N is the number of bins. 



H(S) lies in the range [0,log(N)]. H(S) = log(N) indicates that the 



similarity coefficients of all bins are equal, i.e. = Tj, i,j element of [0,N-1]. The value 
10 H(S) = 0 indicates that there is at most one histogram bin over which the histograms P and 
Q are similar. In this embodiment of the image retrieval system according to the invention, 
the similarity is obtained by combining the entropy H(S) and the sum of the similarity 
coefficients using the following equation: 



In which: 

Se(Q,P) is the similarity between the candidate image and the query image, 
H(S) is the entropy according to equation (6), and 
20 rj is the similarity coefficient according to equation (4). 

Se(Q,P) lies in the range [O.Nlog(N)]. A larger value of Se(Q,P) indicates 
a higher similarity between the candidate color histogram P and the query color histogram Q. 
If Se(Q,P) = 0, P and Q are very dissimilar. If Se(Q,P) = Nlog(N). P and Q are identical. 

25 In the embodiments of the image retrieval system described above, a 

single color histogram is made from the whole image. Because of this, the spatial 
information from the image is lost and the comparison of two images reflects only global 
similarity. For example if a user enters a query image with a sky at the top and sand at the 
bottom, the retrieved images are expected to have a mix of blue and beige, but not 

30 necessarily a sky and sand. A desirable result for the retrieved candidate images would be 
images with blue at the top and beige at the bottom. In order to achieve this result, a further 



15 



(7) 
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embodiment of the system according to the invention deteraiines a color histogram for a 
number of respective regions of the query image and compares these determined histograms 
with histograms of corresponding regions of the candidate image. The query image may be 
divided into regions using pre-fixed boundaries, e.g. the division of the image into a number 
of rectangles. Furthermore, the regions may be indicated manually by the user taking into 
account important objects in the query image. In this way, the user forces that a histogram is 
made for a region comprising the object of interest. The choice of the region size is 
important since it governs the emphasis that is given to local information. In one extreme, 
the whole image is considered as a single region so that only global mformation is used for 
the comparison. In the other extreme, the region size matches the individual pixels. In one of 
the further embodiments of the retrieval system according to the invention, the images are 
divided into 4x4 rectangular regions. 

Combining ttie region similarities corresponding to the respective regions 
of the query image and the candidate image into an overall similarity should avoid that too 
much emphasis is put on any one the region similarities. Therefore, the embodiments of the 
system according to the invention with multiple color histograms per image use the median 
of the region similarities as a measure of the similarity for the whole image. In the 
embodiment of the system using the KuUback informational divergence, the overall similarity 
between the candidate image and the query unage, based on similarities of respective regions 
of the images is calculating according to the following equation: 



- Median /o\ 

SAW = (A:,/6[0,M-l]){S,{e„,P«)} ^ ^ 



In which: 

Iq is the query image, 

Ip is the particular candidate image, 

•Sk(Iq, Ip) is the overall similarity between image P and Q, 

Old is the color histogram of region k,l of the query image, 

Pu is the color histogram of region k,l of the particular candidate image, 

SK(Qki» Pki) is the similarity between region k,l of the candidate image and region k,l of the 

query unage, based on the Kullback informational divergence according to equation (3), and 

M is the number of regions into which the image is divided in the horizontal and in the 
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vertical direction. 

The median function sorts the individual region similarities and selects the middle one to be 
the overall similarity, 

5 In the embodiment of the system using the entropy measure, the overall 

similarity between the candidate image and the query image, based on similarities of 
respective regions of the images is calculated according to the following equation: 



- , ^ Median /q\ 
Se^W = (A:,/€[0,M-1]){5/Q«,P„)} 



10 



In which: 

Iq is the query image, 

Ip is the particular candidate image, 

^eOq» Ip) is the overall similarity between image P and Q, 

15 Qu is the color histogram of region k,l of the query image, 

Pm is the color histogram of region k,l of the particular candidate image, 

Se(Qu, Pu) is the similarity between region k,l of the candidate image and region k,l of the 

query image, based on the entropy measure according to equation (7), and 

M is the number of regions into which the image is divided in the horizontal and in the 

20 vertical direction. 

The images in the database of the retrieval system according to the 
invention are organized into clusters so as to allow a search to images similar with a given 
query image without the need of comparing all images with the query image. According to 
the invention, clusters of images are defined whereby similar images are grouped in a same 

25 cluster and a cluster center is defined for such cluster which is representative of the images 
in the cluster. In an embodiment of the method of organizing the images in the database 
according to the invention, the images are clustered in a hierarchical way. The number of 
images in the database is n and the similarities between all pairs of images is precomputed. 
The calculation of the similarities between the candidate images in the database is carried out 

30 using the same feamre vector described above for the calculation of the similarity between 
the query unage and a candidate image, namely the color histograms of the relevant images. 
However, a different type of feature vector may be used since, the process of clustering the 
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images in the database is not directly linked to the process of searching the database. The 
hierarchical clustering is carried out as follows: 

1 The n images in the database are placed in n distinct clusters, these clusters are 

called leaf clusters and are indexed by (Q, C2, C„}. For the kth cluster, 
5 the set contains all the images contamed in that cluster. For all leaf clusters, 

Ef, = {k} and the number of images is A/j^ = 1. 
2. Two clusters Q and Q are selected for which the similarity with one another is 

the largest. The calculation of the similarity between two clusters is described 

below. 

10 3. These two clusters are merged into a new cluster C„+^. This reduces the number 

of clusters by one. The set of images in this new cluster becomes = { 
U £;} and the number of unages in this new cluster becomes = A^,. 
4. Steps 2 and 3 are repeated until the number of clusters has been reduced to a 

required number or the largest similarity between the clusters has dropped 

15 below a predetermined threshold. 

Figure 2 shows a simple example of organizing images into clusters 
according to the invention. The database contains 8 images, unages 202 - 216. In step 1, 
these images are put in 8 distinct clusters. In a first execution of step 2, out of all pairs from 

20 among the 8 clusters, cluster 202 and cluster 206 appear to be the most sunilar with each 
other. These two clusters are merged into a new cluster 218 in step 3. This results in a 
situation with 7 clusters, namely clusters 218, 204, 208, 210, 212, 214 and 216. After a 
second execution of step 2, it appears that from among all pairs of these 7 clusters, cluster 
204 and cluster 208 have the highest similarity. These are merged into new cluster 220. In 

25 repeated steps 2 and 3, then clusters 212 and 214 are merged into cluster 222, cluster 222 
and cluster 216 are merged into cluster 224, and cluster 218 and cluster 220 are merged into 
cluster 226. This results in a structure with 3 different clusters: cluster 226, cluster 210 and 
cluster 224. Then in a last execution of step 2, it is established that clusters 226 and 210 
have the highest sunilarity among these 3 clusters and are therefore merged into a new 

30 cluster 228. The process of clustering is now stopped since it has reached a number of two 
distinct clusters, which was the required number in this example. For this example = 
5, iVj2= 3, £h = {1,2,3,4,5} and E^^ {6,7,8}. 

In another embodiment of the method according to the invention, an 
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alternative clustering technique is chosen. This alternative clustering is carried out as 
follows: 

1. The number of clusters is chosen a priori. The centers are chosen by 
randomly picking images from the database. 

2. For each image in the database, the suBilarity measure between the image and 
the cluster centers are computed and the image is assigned to the cluster with 
which it exhibits the largest similarity measure. 

3. New cluster centers are computed as the centroids of the clusters. 

4. Steps 2 and 3 are repeated until there is no further change in the cluster centers. 

In an embodiment of the method for organizing images according to the 
invention, the measure of similarity between two clusters Q and Q is defined in terms of the 
similarity measures of the images that are contamed m those clusters using the following 
equation: 



E 

_ I , j € {E,UE,} j 



(10) 



In which: 

Sj^ i is the similarity measure between the clusters Q and Q, 
Ef, is the set of hnages in cluster Q, 
El is the set of images in cluster 

is the similarity measure between two images / and y, and 
PiNk + AB) is the number of pairs of images in the combination of clusters Q and Q. 

So Si^i is defined to be the average similarity between all pairs of images 
that will be present in the cluster obtained by merging Q and This ensures that when two 
clusters are merged, the resuhing cluster has the largest similarity between all images m 
those two clusters. Since the similarity between clusters is defined in terms of the similarity 
measures between the images in the clusters, there is no need to compute the cluster centers 
every time two clusters are merged. 

The number P„ of pairs of images in a cluster with n images is calculated using 
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the following equation: 



= - 1)^ 



^ (11) 



When two clusters Q and Q are merged mto a new cluster Q,, the it is 
necessary to calculate the sunilarity of this cluster with all other clusters. This can be done 
using equation (10). However, this calculation is computational intensive and a faster, 
recursive calculation can be done. For any given cluster Q, the similarity between the cluster 
Q and the cluster C, is recursively calculated using the following equation: 

In which: 

S^ j is the similarity measure between the clusters and C,, 
15 is the number of pairs of images from among x images. 

At the beginning of the clustering, for all leaf clusters 5^^ is set equal to Sij and S^j is set 
equal to zero. 

In the above embodiment of the method for organizing images according 
to the invention, the cluster center of a cluster is defined as the average of the color 

20 histograms of a number representative images in that cluster. These images are selected in 
such a way that the cluster center computed from them is close to all the images in the 
cluster. The tree structure that was obtained as a by-product of the hierarchical clustering 
algorithm is effectively used to select the set of representative images. In the explanation 
below, a subcluster is a cluster that has been merged with another cluster to form part of a 

25 resulting, larger cluster. In the example of Figure 2, the representative images for cluster 
are selected with die following considerations. From the tree structure it can be inferred that 
the images 1 and 3 belong to one subcluster and unages 2 and 4 belong to another subcluster. 
Hence a good selection of representative images, if the number r of representative images is 
3, is to select one from {1,3}, another from {2,4} and 5. If the number r = 2, then it is apt 

30 to select one from {1,2,3,4} and 5 as representative images. Similarly for C,2, it is better to 
select 6 and 8 or 7 and 8 instead of 6 and 7. A selection accordhig the above considerations 
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results in a representative set that captures the diversity of images present in a cluster. 

In general, for a cluster Q a representative set of r images is selected as 
described hereafter. A set of r subclusters is chosen from the tree associated with Q and 
from each of these subclusters a representative image is selected, resulting in r representative 
images. This procedure includes the following steps: 



1. et n = 0 and form a set /?„ = {/}. If r = 1, then go to step 5. 

2. Each element in /?„ is an index of a subcluster. Find an element k such that N^, 
is the largest, i.e. find the subcluster at the next level of the hierarchy with the 
largest number of images in it. 

3. Form a new set R„^i by copying all the elements of /?„ except k and adding the 
right child of Q a and the left child of Q. 

4. Repeat steps 2 and 3 until the number of elements contained in is equal to r. 

5. Now /?„ contains r subclusters from the tree associated with Q. From each of 



these subclusters a representative image is chosen. If * is an element of and 
Q is a leaf cluster, then Nj, =^ I 2jad the selection is straightforward, i.e. the 
image associated with the leaf is selected. If Q is not a leaf, i.e. A^it > 1, then 
it is necessary to select a single image from This is done by selecting the 
one that has the maximum average similarity measure with the other iV* - 1 
images of 

For the example shown in Figure 2, finding a representative set of images for with r = 
2, begins with /?o = {^^} R^ = {13,5} and the iteration stops here as R^ contains two 
elements akeady. Now, since C5 is a leaf cluster with a single element, image 5 is chosen as 
one representative. Another representative is selected from that contams four images 
{1,2,3,4} by calculatmg the average similarity of each image with the other three images and 
then choosing the one with the raaxunum average similarity. Assuming that unage 2 has the 
largest average similarity, the representative set of images for cluster C,4 is {2,5}. After 
selecting a set of r representative images, the averages of their corresponding histograms are 
used to represent the cluster center. In the embodiment of the invention where multiple 
histograms per image are employed, i.e. a histogram per region of an image in order to 
capture spatial information, the cluster center is represented by multiple histograms which 
are obtained by averaging the corresponding regions of the representative images. So, if in 
the example above the image would be divided into 16 regions, the cluster center for C14 is 
represented by 16 histograms which are obtained by averaging the corresponding histograms 
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of the regions of images 2 and 5. 

A cluster center can be computed for each of the clusters, not being a 
subcluster, to represent the images in the cluster. After computing the cluster center, the 
optimality of the clustering and the computation of clusters is evaluated. Cluster centers are 

5 optimal, if for each image contained in a cluster, the similarity measure between the image 
and that cluster is larger than the respective similarity measures between the image and all 
other cluster centers. This may not be true, especially given the fact that only a 
representative set of images is used to compute the clusters centers and not all the iniages in 
the cluster. As a result of that, an image may have a larger similarity measure with another 

10 cluster center than with its own cluster center. All such images are moved to their closest 
clusters to optimize the cluster centers. This cluster center optimization is carried out as 
follows: 

1. For each of the n images in the database, the similarity measures between the 
image and all cluster centers is determined. If the cluster with the maximum 

15 similarity is the same cluster in which the image is present then nothing is done. 

If not, the image is moved from the cluster in which it resides to the cluster that 
it is most similar to. The trees of both clusters are rearranged to reflect this 
removal and addition as described below. 

2. The cluster centers of the relevant clusters are recomputed. These are the 

20 cluster from which an image has been removed or to which an image has been 

added. 

3. Steps 1 and 2 are repeated until the nmnber of images to be moved is below a 
threshold. Then, these images are moved as in step 1 and the cluster centers are 
not recomputed as in step 2. 

25 At the end of step 3, all images exhibit the largest similarity measure with the center of the 
cluster in which they are present and hence the clustering is optimal. 

When an image is removed from a cluster, the associated tree of 
subclusters and images is updated. When an image is removed, there are two possible 
scenarios depending on whether the removed node is a direct child of the root of the cluster 

30 or not. Figure 3 shows an example of the removal of an image which is a direct child of the 
root of the cluster. This shows the removal of image 5 in the example cluster of Figure 2. 
When image 5 is removed, the node 13 becomes redundant and hence it is removed and 
replaced with node 14. Now N^^ = 4, = {1,2,3,4}, the right child RC^^ = 10 and the 
left child LCi4 == 9. Figure 4 shows an example of the removal of an image which is not a 
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direct chUd of the root. This shows the removal of image 2 in the example cluster of Figure 
2. When unage 2 is removed, node 10 wUl have only one child and hence it is removed. 
Image 4 then becomes a child of node 13. Now N,, = 3, £,3 = {1.3,4}. the right chUd RC^ 
= 4 and the left child LC,3 = 9. And for C», = 4. £„ = {1,3,4,5}, the right child RC,^ 

5 = 5 and the left child LC,4 = 13. 

Figure 5 shows an example of the addition of an image to a cluster. When 
adding the image to a cluster, it is necessary to decide where to insert it as a new node in the 
tree. This decision is made top-to-bottom. If an image is to be added to cluster Q with right 
child Ci and left child Cj, then first and are updated to reflect the addition of the node 

10 to the tree of Q. The decision of whether to add the image to Q or q is based on the 

similarity measure. The image is added to Q, if the average similarity of the image with all 
the images in C, is larger than the average similarity of the image with all the hnages in Cj 
and vice versa. If the node is added to C„ the parameters associated with C, are updated. The 
image is then added to either the right or left child of Q. This process is repeated recursively 

15 until the leaves of the tree are reached. Figure 5 shows the addition of image 2 to cluster Qa 
of Figure 2. First the parameters associated with Cn are update, N,2 = 4 and £12 = 
{6,7,8,2}. Then to decide whether the image 2 is to be added to C„ or Cg, the similarity 
measure J2.8 is con^)ared with (^2.6 + S2,t)/2. If the latter is larger, then £„ s updated to 
{6,7,2} and iV,i is updated to 3. -Then ^2.6 and 52,7 are compare and if 52,7 is larger, a new 

20 node C,, is created vdth right child RC^s = 2 and left chUd LC,5 = 7. Also the right child of 
C„ is updated, i?C„ = 15. 

Figure 6 shows the most important components of the image retrieval 
system according to the invention. The image retrieval system 600 is implemented according 
to a known architecture and can be realized on a general purpose computer. The image 

25 retrieval system has a processor 602 for carrying out instructions of an application program 
loaded into working memory 604. The image retrieval system further has an interface 606 
for communication with peripheral devices. There is a bus 608 for exchange of commands 
and data between the various components of the system. The peripherals of the image 
retrieval system mclude a storage medium 610 containing the executable programs, the 

30 database with images, and various other data. The storage medium 610 can be realized as 
various separate devices, potentially of different kind of storage device. Application of the 
invention is not restricted by the type of device and storage devices which can be used 
include optical disc, magnetic disc, tape, chip card, solid state or some combination of these 
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devices. Furthermore,, some of the data or images may be at a remote location and the image 
retrieval system may be connected to such a location by a network via connection 611. The 
peripherals of the image retrieval system fiirther include a display 612 on which the system 
displays, amongst others, the query image and the candidate images. Furthermore the 

5 peripherals preferably include a selection device 614 and a pomting device 616 with which 
the user can move a cursor on the display. Devices 614 and 616 can be integrated mto one 
selecting means 618 like a computer mouse with one or more selection buttons. However, 
other devices like a track ball, graphic tablet, joystick, or touch sensitive display are also 
possible. In order to carry out the various tasks, a number of software units are loaded mto 

10 the working memory 604. An entry unit 122 enables flie user to enter the query raiage into 
the system. A cluster comparison unit 128 is arranged for comparing the query image with 
the cluster centers of the clusters with images and for computmg respective cluster 
similarities. The selection unit 132 is arranged to select the clusters of which the cluster 
centers exhibit the highest similarity with the query image. The image comparison unit 136 is 

15 for comparing the query image with the images in the selected clusters and for computing the 
image similarities between the query image and those images. The presentation unit 140 is 
arranged for presenting the images in a ranked order with respect to their image similarity, 
so that the user is shown the images with the highest unage similarity first. The distribution 
of the system* s functionality over the various software unit may be implemented in a 

20 different way than as described above. Some unit may be combined or other units may be 
used to realize a certain task. Furthermore, the working memory 604 has memory space 620 
for temporarily storing input and output data and intermediate results, like the respective 
histograms and the determined similarity. 

Figure 7 shows the most important components of the system for 

25 organizing the images according to the invention. The system 700 for organizmg the unages 
is implemented according to a known architecture and can be realized on a general purpose 
computer. The system 700 has a processor 702 for carrying out instructions of an application 
program loaded into working memory 704. The system 700 fiuther has an interface 706 for 
communication with peripheral devices. There is a bus 708 for exchange of commands and 

30 data between the various components of the system. The peripherals of the image retrieval 
system include a storage medium 710 containing the executable programs, the database with 
images, and various other data. The storage medium 710 can be realized as various separate 
devices, potentially of different kind of storage device. Application of the invention is not 
restricted by the type of device and storage devices which can be used include optical disc. 
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magnetic disc, tape, chip card, solid state or some combination of these devices. In an 
embodiment of the system for organizing images according to the invention, the images that 
are to be organized reside on a hard disk and the resulting database also resides on this hard 
disk. However, the images may be transferred into the system in another way, e.g. via a 
5 tape. Furtheraiore, some of the data or images may be at a remote location and the system 
may be connected to such a location by a network via connection 712. The system 700 has a 
clustering unit 714 loaded into the working memory for defining the clusters whereby 
mutually similar nnages are grouped into a same cluster. Furthermore, the system 700 has a 
center determining unit 716 for determinihg the cluster centers for the respective clusters. 
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CLAIMS: 



1, An image retrieval system comprising: 

a database with clusters, each cluster comprising a respective set of candidate 
images and a cluster center which is representative for that set; 

entry means for entering a query image; 
5 - cluster comparison means for comparing the query image with respective 

cluster centers to establish respective cluster similarities between the query image and the 
respective clusters; 

selection means for selecting at least the cluster with the largest cluster 
similarity with the query image; 
10 - image comparison means for comparing the query image with the 

candidate images in the selected clusters to establish respective image similarities between the 
query image and the respective candidate images; and 

presentation means for presenting at least the candidate image with the 

largest image similarity. 

15 

2. An image retrieval system according to Claim 1 , in which at least one of 
the cluster centers is represented by a color histogram which is the average of respective 
color histograms of a number of representative images in the particular cluster. 

20 3. A method for retrieving images from a database comprising clusters, each 

cluster comprising a respective set of candidate images and a cluster center which is 
representative for that set, the method comprising the steps of: 
inputting a query unage; 

comparing the query image with respective cluster centers to establish 
25 respective cluster similarities between the clusters and the query image; 

selecting at least the cluster with the largest cluster similarity with the 

query image; 

comparing the query image with respective candidate images of the 
selected clusters to establish respective image similarities between these candidate images and 
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the query image; and 

presenting at least the candidate image with the largest image similarity. 

4 A method for organizing images in a database, the method comprising the 

steps of: 

defining clusters each comprisuig a subset of the images, whereby the 
images in a cluster are similar with each other and whereby at least one of the clusters 
comprises more than one image, and 

determining a cluster center for each of the clusters. 

5. A method according to Claim 4, further comprising the step of 
determining the similarity between each respective cluster and each other one of the clusters, 
wherein 

the step of defining the cluster includes merging the two clusters with the 
largest mutual similarity into one new cluster, and 

the step of determining a cluster center includes determining a cluster 
center for the new cluster. 

6. A method according to Claim 5, wherein the similarity between two 
clusters is determined on the basis of the average of the similarities between all pairs of 
images in the two clusters. 

7. A method according to Claim 5, wherein the cluster center of the new 
cluster is determined on the basis of images selected from respective ones of the two clusters 
that had been selected for merging into the new cluster. 

8. A method according to Claim 4, further comprising a cluster center 
optimization step including: 

determining the similarity between at least one of the images and each of 
the cluster centers and 

if that image has a larger similarity with the cluster center of another 
cluster than with the cluster center of its own cluster, moving that image to that other 
cluster. 
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9 A system for organizing images in a database, the system comprising: 

clustering means for defining clusters each comprising a subset of the 

images, whereby the images in a cluster are similar with each other and whereby at least one 

of the clusters comprises more than one image, and 

center determining means for determining a cluster center for each of the 

clusters. 

10. A database with a plurality of images, the database comprising: 

clusters each comprising a subset of the images, whereby the images in a 

cluster are similar with each other and whereby at least one of the clusters comprises more 

than one image, and 

a cluster center for each of the clusters. 
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